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RArKGROUNO OF THE INVENTION 

The present invention relates co computers, and 
more particularly to cache memories in computer 



systems . 

Present computer systems use multi-port caches to 
provide appropriate data flow to execution units of 
processors that implement instruction level parallelism 
15 or to multiple processors. It is desirable to provide 
faster economical multi-port caches. 

SUMMARY 

The present invention provides fast economical 
20 multi-port caches in some embodiments. In some 

embodiments, the cache is set associative. If cache 
misses occur on more than one ports simultaneously, 
different replacement sets are chosen for different 
cache misses. A separate write port is provided for 
25 each set. Therefore, multiple replacements can proceed 
in parallel. In non-blocking cache embodiments, the 
performance of a processor or processors using the 
cache is therefore increased. 

Since each set has its own write port, the set 
30 does not need multiple write ports to allow 

simultaneous access for different cache misses. The 
cache cost is therefore reduced. 

In some embodiments, the sets are divided into 
aroups of sets. A separate write port (i.e., address 
35 decoder) is provided for each group of sets. A 

separate write strobe is provided for each set. If 
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simultaneous cache misses occur, replacement sets are 
selected from different groups. The replacement sets 
are updated in parallel. Each group of sets does not 
need multiple write ports to allow simultaneous access 
5 for different cache misses. The cache cost is 
therefore reduced . 

In some embodiments, for each cache entry, a tree 
data structure is provided to implement a tree 
replacement policy. If only one cache miss occurred, 
10 the search for the replacement sets starts at the root 
of the tree. If multiple misses occurred 
simultaneously, the search starts a: a tree level that 
has at least as many nodes as there were cache misses. 
For each cache miss, a separate node is selected at 
15 that level; the search for the respective replacement 
set starts with the selected node. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram illustrating a dual-port 
20 cache and a cache replacement policy according to the 
present invention . 

Fig. 2 is a diagram of a cache block in the cache 
of Fig. 1. 

Fig. 3 is a diagram of an external memory address 
25 of data in the block of Fig. 2. 

Fig. 4 is a block diagram of another cache of the 
present invention . 

Fig. 5 is a block diagram of a computer system 
including a cache of the present invention. 
30 Figs. 6A, 6B are a block diagram of a portion of 

the cache of Fig. 5. 

Fig. 7 is a block diagram of steps performed by 
the cache of Fig. 5. 

Figs. 8 and 9 are block diagrams of portions of 
35 the cache of Fig. 5. 



-2 - 



\VO 98/14951 ' 



PCT/RU96/00282 



10 



15 



Fig. -10 is a block diagram of a processor- 
including a cache of the present invention. 

ppcrpTPTTON OF PREFERRED EMBO DIMENTS 

Fig. 1 illustrates a double-ported four-way set- 
associative non-blocking cache 110. Cache 110 has four 
sets 0 through 3, also labeled 120.0 through 120.3. 
Each set includes a number of blocks 206 (128 blocks in 
some embodiments). As shown in Fig. 2, each block 206 
includes a tag 210, a data block 220, and valid bits 
230 . 

Data from external memory are placed in cache 110 
as follows. The external memory address 304 {Fig. 3) 
of the data is subdivided into three fields 210, 310 
and 320. Tag field 210 is scored in block 206. Index 
310 determines the address of block 206 in a set 120. i. 
The data can be cached in any set 120 . i at the slot 
corresponding to index 310. Index 310 is also called 

an entry number. 

Field 320 determines the offset of the data in 

data block 220. 

Ail cache blocks 206 having a given entry number 

form a "cache entry" . 

Cache 110 has two ports and thus is suitable for 
25 use in a processor that has two or more channels for 

memory access. Examples of such processors are 1) very 
large instruction word (VLIW) processors and 
2) superscalar processors. Cache 110 is also suitable 
for multi-processor systems including single channei 
and/or multiple channel processors. 

Cache 110 includes memory that stores bits R0, Rl , 
R2 to implement a tree replacement policy. A separate 
triple R0, Rl, R2 is provided for each cache entry. 
Fo- each entry, bits R0 , Rl, R2 implement a tree 
35 structure. Rl and R2 are leaf nodes of the tree. The 
leaf Rl selects set 0 or set 1 as a replacement set. 



20 



30 
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More particularly, Rl - 0 seieccs sec 0; Ri = i selects 
set 1. For each cache entry, Rl selects the LRU (least 
recently used) of sets 0 and 1, that is, Rl selects the 
LRU of the two cache blocks in the respective entry in 
sees 0 and 1. 

Similarly, R2 = 0 selects set 2, and R2 = l 
selects set 3. R2 selects the LRU of sets 2 and 3. 

R0 = 0 selects the group of sees 0, 1 (group 0) . 
R0 = 1 selects the group of sets 2, 3 (group 1) . For 
each cache entry, R0 selects the LRU of groups 0, 1. 
This replacement policy is called herein "tree-LRU" . 

If a cache miss occurs on one, but not both, of 
ports 0 and 1, a replacement set is selected as 
follows. The cache entry is determined from index 310 
15 of the cache-miss address 304. For this cache entry, 
bits R0, Rl, R2 are examined. If bit R0 selects 
group 0, then the replacement set is selected by bit 
Rl. If RC selects group 1, the replacement set is 
selected by bit R2 . 
20 If a cache miss occurs on both ports 0 and 1 

simultaneously (on the same clock cycle) , then 
different groups of replacement sets are selected for 
different ports. The replacement set for port 0 is 
selected by bit Rl for the cache entry corresponding to 
25 index 310 cn port 0. The replacement set for port 1 is 
selected by bit R2 for the index 310 on port 1. Bits 
R0 are ignored. Selection of different sets 
facilitates simultaneous writing of new information 
into the replacement sets. In particular, a single 
30 write port for each set is sufficient to write the 

information simultaneously. Moreover, even a single 
write port address decoder for each of groups 0 and 1 
is sufficient. 

Fig. 4 illustrates a cache 110 in which different 
35 replacement sets are selected for up to N cache misses. 
Hence, simultaneous replacements are provided for up to 
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N cache ports. N can be any integer greater than 1. 
The sets are divided into M groups. The replacement 
sets are selected using a tree replacement policy. 
More particularly, the cache includes trees of data 

structures Ri.j, i-0 k; Nl Nk-N. A 

separate tree is provided for each cache entry. If a 
single cache miss occurs, the search for the 
replacement set starts with the root data structure 
R0 . 1 . The search is performed in the tree 
corresponding to the cache miss index 310. The root 
structure R0 . 1 selects one of the structures Rl . 1 
through R1.N1 at the next tree level.. Each data 
structure Rl . i selects one of structures R2 . 1 through 
R2.N2 at the following tree level, and so on. Each 
leaf Rk.l through Rk.N selects a replacement set in the 
corresponding group 1 through M. The tree search 
proceeds from the root to the leaves in a conventional 
manner . 

If the number M of cache misses occurring in a 
given clock cycle is greater than 1 but does not exceed 
Ml, m tree nodes are selected from nodes Rl.l through 
Rl.Nl. For each cache miss, the selected node is in 
the tree corresponding to the cache entry in which the 
replacement is to be made. Different selected nodes 
Rl'.j have different' "j" parameters. M searches occur 
in parallel starting with the selected nodes. Each 
search proceeds conventionally in the subtree in which 
the selected node is the root. Each search results in 
a separate replacement set . 

If the number M of simultaneous cache misses is 
greater than Ml but does not exceed N2 , then M nodes 
are selected from the nodes R2 . 1 through R2.M2, and so 
on. The tree searches for M replacement sets start 
with the selected nodes. 

If the number of simultaneous cache misses is 
greater than N,. : (the number of immediate parents of 
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Che leaf nodes) , the searches scar: with the leaf 
nodes . 

Writing to the replacement sees can be done in 
parallel if each set has a separate write port. 
Writing can be done in parallel even if a single write 
pore address decoder is provided for each group l 
through N. 

In some embodiments, cache 110 of Fig. 4 uses a 
tree-LRU replacement policy. More particularly, for 
each cache entry CE, each leaf node Rk . i selects the 
LRU set in the corresponding group of sets. In other 
words, each leaf node selects a set having the LRU dac< 
block in the corresponding entry in the corresponding 
group of sets. Each non-leaf node NLN selects an LRU 
group of sets and hence an LRU group of data blocks. 
More particularly, each' immediate child of non-leaf 
node NLN is a root of a subtree. (The subtree may 
contain only one node if the child is a leaf.) All the 
leaf nodes of the subtree define a group G of the sets 
which are all the sets of all the groups corresponding 
to the leaves of the subtree. We will say that the 
group G corresponds to the root of the subtree. Thus, 
each child corresponds to a group of sets and hence to 
a group of blocks in cache entry CE . The non-leaf node 
NLN selects one of its immediate child nodes and hence 
selects one of the groups of blocks. The selected 
group of blocks is the LRU group of blocks. 

Fig. 5 is a block diagram of a computer system 510 
incorporating one embodiment of cache 110. Cache 110 
is a write-through data cache ("DCACHE") internal to a 
VLIW RISC processor 520. Processor 520 is shown also 
in Fig. 10 and described in the Appendix. Processor 
520 includes instruction execution unit (IEU) 530. IEU 
530 includes four ALUs (arithmetic logic units) ALU0 
through ALU3 . The four ALUs provide four parallel 
execution channels 0 through 3 for arithmetic and logic 
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operations. IEU 530 includes four Array Access 
Channels AACO - AAC3 to generate array element 
addresses for loops. AACO and AAC2 are used only for 
memory load operations (operations that load data from 
5 external memory 550) . AAC1 and AAC3 are used both for 
load and store operations. 

In addition to arithmetic and logic operations, 
ALU1 and ALU3 are used to calculate addresses for 
scalar memory accesses. 
10 Accordingly, IEU 530 has four channels 0 through 3 

for communication with external memory 550 through .- 
' external interface 540. Channels 1 and 3 are used both 
for reading and writing the memory. These channels go 
through cache 110. Channels 0 and 2 are used for 
15 reading only. These channels do not go through cache 
110 . 

In IEU 530 , channel 1 includes cache-hit input 
CHI, address -valid output VI, virtual -address output 
VAl, physical-address output PAi, data output Dl , and 

20 data input CDl. Channel 3 includes cache -hit input 
CK3, address -valid output V3 , virtual -address output 
VA3, physical -address output PA3 , data output D3 , and 
da-a inpuc CD3 . Ports CHI, VI, VAl, Dl , CDl, CH3, V3 , 
VA3, D3, CD 3 are connected to cache 110. Ports PAI, 

25 PA3 are connected to external interface 540. Data on 
outputs Dl, D3 are written to cache 110. These data 
are also written to memory 550 through external 
interface 540 and bus 554. 

Channels 0 and 2 are not shown in Fig. 5. In IEU 

30 530, channel 0 includes address -valid output V0 and 
physical -address output PA0 . Channel 2 includes 
address-valid output V2 and physical -address output 
PA2. Ports PA0, PA2, V0 , V2 are connected to external 

interface 540. 
35 Channels 0-3 can be accessed in parallel. 
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External interface 540 and external memory 550 are 
interconnected by bus 554. Bus 554 includes four 
bidirectional channels that can access memory 550 in 
parallel. To write data to memory 550, the four 
channels of bus 554 can be multiplexed onto any one or 
more of channels 1 or 3 . In particular, each of the 
four channels of bus 554 can communicate with one of 
the channels 1 or 3 . 

To read data from memory 550, CPU 520 has four 
parallel channels MDO through MD3 . Each channel MDi 
communicates with a respective one of the channels of 
bus 554. Channels MDO through MD3 include outputs MDO 
through MD3 in external interface 540. These outputs 
are connected to respective inputs MDO through MD3 of 
IEU 530 and to respective inputs of cache 110. ' These 
inputs of cache 110 are illustrated in Fig. 9 as inputs 
of multiplexers 930.1 through 930.3. 

Memory 550 includes a higher level cache in some 
embodiments . 

Memory control logic external to processor 520 is 
not shown. 

To read memory on channel I • or 3, IEU 530 drives 
the memory virtual address on respective lines VA1 or 
VA3 and asserts the respective valid signal VI or V3 . 
If a cache hit occurs., cache 110 asserts respectively 
CHI or CH3 , and writes data to IEU 530 on respective 
lines CD1 or CD3 . If a cache miss occurs, cache 110 
asserts respective request signal RQ1 or RQ3 to 
external interface 540. IEU 530 provides the physical 
address on respective lines PA1 or PA3 . In response, 
data from memory 550 are written to cache 110 and IEU 
530 via one or more of the channels MD0-MD3. 

Fig. 6, which includes Figs. SA and 6B, is a 
diagram of a tag portion of cache 110. Cache 110 is a 
four-way set associative cache. Tag memories 610.0 
through 610.3 (Fig. 6B) store tags 210 of respective 
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sees 0 through 3. Each memory SlO.i includes two read 
ports and one write port. The address input o£ one of 
the read ports receives index portion II of address VA1 
from IEU 530. The address input of the other read port 
5 receives index 13 of address VA3 . 

The outputs TM1, TM3 of memory 510.0 are connected 
to inputs of respective comparators S20.0.1, 620.0.3. 
The other input of comparator 620.0.1 is connected to 
the tag portion Tl of address VA1 . The other input of 
10 comparator 620.0.3 is connecced to tag portion T3 of 
address VA3 . Thus, the output signal of comparator 
620.0.1 indicates whether Tl is equal to the tag at 
entry number II in memory 610.0. Similarly, the output 
of comparator 620.0.3 indicates whether the tag T3 is 
15 ecrual to the tag at entry number 13 in memory 610.0. 

in the same manner, the outputs TM1 , TM3 of each, 
memory 610. i are connected to inputs of respective 
comparators 620. i.l, 620.1.3. The other inputs of the 
two comparators are connected respectively to Tl, T3 . 
20 OR circuit 630.1 generates a signal hi. hi is the 

OR of the outputs of comparators 6 20.1.1, l = 0, 1, 2, 
3. AND gate 632.1 generates CHI = hi AND VI. VI is 
the address-valid output of IEU 530. Signal CHI 
indicates whether a cache hit occurred on channel 1. 
25 Signal CHI is delivered to input CHI of IEU 530. 

Similarly, circuit 630.3 generates signal b.3 which 
is the OR of the outputs of comparators 520.1.3; AND 
gate 632.3 generates CH3 = h3 AND V3 . Signal CH3 
indicates whether a cache hit occurred on channel 3. 
30 Signal CH3 is delivered to input CH3 of IEU 530. 

Circuits 630.1, 630.3 also generate respective 
signals /hi, /h3 which are the complements of 
resoective signals hi, h3 . "/" before a signal name 
indicates a complement. AND gate 534.1 generates RQ1 = 
35 VI AND /hi. AND gate 634.3 generates RQ3 = V3 AND /h3 . 
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Four-bit signal SI is composed of the outputs of 
four comparators 620. i.l. 31 indicates: 1) whether a 
cache hit occurred on channel l, and 2) if the hie 
occurred, in which set it occurred. Similarly, signal 
5 S3 composed of the outputs of four comparators 620.1.3 
indicates: 1) whether a cache hit occurred on channel 
3; and 2) if the hit occurred, in which set it 
occurred. Signals SI, S3 are delivered to attribute 
and tag control (ATC) circuit 640 (Fig. 6A) . 
0 Atcribute memory 650 (Fig. 6A) stores three 

attribute bits RO, Rl, R2 for each cache entry. Memory 
650 has two read ports and two write ports. Indices 
II, 13 are connected to address inputs of the 
respective read ports of memory 650. Indices II, 13 
5 are connected also to the address inputs of the 
respective write ports of memory 650. 

When the tag memories 610. i are read, attribute 
memory 650 is also read on both read ports. The 
attributes provided by memory 650 are delivered to ATC 
0 circuit 640. 

Comparator 660 compares the tag Tl with the tag T3 
and the index II with the index 13. Comparator 660 
generates: 1) signal TEQ indicating whether Tl = T3 ; 
and 2) signal IEQ indicating whether II = 13. Signals 
5 TEQ, IEQ are delivered to ATC circuit 640. 

Circuit 640 receives also address -valid signals 
VI, V3 from IEU 530. 

Write strobe output WS1 and attribute output ATI 
of circuit 640 are connected to one write port of 
memory 650. Write strobe output WS3 and attribute 
output AT 3 of circuit 640 are connected to the other 
write port of memory 650. When the write strobe 
outputs WS1 and/or WS3 are asserted, the attributes on 
the respective outputs ATI and/or AT 3 are written- to 
memory 650 at addresses corresponding to respective 
indices II and/or 13. 
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Circuit 640 has four write strobe outputs TWS1 
(Fig. 6 A) connected to write strobe inputs of 
respective memories 610.0 through 610.3. Circuit 640 
also has multiplexer control outputs MCI. One of the 
I outputs MCI is connected to select inputs of 

multiplexers 6701.1, 670T.1. The other one of outputs 
MCI is connected to select inputs of multiplexers 
57C".3, 670T.3. Two data inputs of multiplexer 6701.1 
receive respective indices II, 13. The output of 
multiplexer 6701.1 is connected to the address inputs 
of the write ports of memories 610. 0, 610.1. Two data 
incuts of multiplexer 6701.3 receive respective indices 
ti" 13. The output of multiplexer 6701.3 is connected 
to the address inputs of the write ports of memories 

15 610.2, 610.3. 

Two data inputs of multiplexer 570T.1 receive 
respective tags Tl, T3 . The output of multiplexer 
67CT.1 is connected to the data inputs of the write 
ports of memories 610.0, 610.1. Two data inputs of 
20 multiplexer 670T.3 receive respective tags Tl , T3 . The 
output of multiplexer 670T.3 is connected to the data 
incuts of the write ports of memories 610.2, 610,3. 

To- write a tag into memory 610.0 or 610.1, circuit 
640 causes multiplexer 6701.1 to select the address II 
25 or 13. Circuit 640 causes multiplexer 67CT.1 to select 
the appropriate tag Tl or T3 . Circuit 640 asserts a 
respective write strobe TWS1 . Writing a tag into 
memory 610.2 or 610.3. is accomplished similarly via 
multiplexers 6701.3, 670T.3. Writing to memory 610.0 
30 or 610.1 can proceed in parallel with writing to memory 

S1G . 2 or 610 .3 . 

In a memory access operation, if a cache miss 
occurred, the tag write operation is delayed from the 
respective tag read. In some embodiments, the .tag 
35 write is performed one or more clock cycles later than 
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the respective tag read; registers 950.1, 950.3 (Fig. 
8) are used to delay the tag writes. 

If a cache reload from external memory 550 is 
needed, the tags and the attributes are written 
5 immediately, before data arrive from memory 550. The 
data can arrive in parallel for channels 1 and 3. 

Circuit 640 implements a tree-LRU replacement 
policy of Fig. 1. Fig. 7 illustrates operation of 
circuit 640 when: (a) VI is asserted to indicate a 
10 memory access on channel 1; and (b) either V3 is 

deasserted (no access on channel 3), or V3 is asserted 
and the signal IEQ indicates that the indices II, 13 do 
not coincide. Fig. 7 illustrates operations performed 
for the index II. If V3 is asserted, similar 
15 operations are performed in parallel for the index 13. 

As shown in Fig. 7, if the signal SI indicates a 
set 0 hit on channel 1 (step 710), circuit 640 writes 
the attributes R0 = 1, Ri = i, R2 = -> to memory 650 at 
address II (step 714). "->» means that R2 remains 
20 unchanged, that is, the new value of R2 is the old 
value read from memory 650. 

Similarly, if signal SI indicates a hit in set 1 
(step 720), circuit 640 wrices R0 =1, Rl = o, R2 = - > 
(seep 724) . 

- 5 1^ SI indicates a hit in set 2 (step 730) , circuit 

640 writes R0 = 0, Rl = ->, R2 = 1 (step 734) . If SI 
indicates a hit in set *3 (step 740) , circuit 640 writes 
R0 = 0, Rl = R2 = 0 (step 744). 

If signal SI indicates a cache miss on channel 1, 

JO and signal S3 indicates a cache miss on channel 3 (step 
750), circuit 640 tests the bit Rl for index II (step 
754) . If Rl = 0, the replacement set for channel 1 is 
set 0. Under the control of circuit 640, tag Tl is 
written to memory 610.0 at address II (step 760).- 

•5 , in parallel with step 760, step 714 is performed 

to update the attributes as described above. 



-12- 



15 



PCT/RU96/00282 

"WO 98/14951 

If Rl = i ac. seep 754, tag TI is written to set 1 
(seep 764) • Step 724 is performed in parallel. 

If there was no cache miss on channel 3, that is, 
V3 was deasserted or V3 was asserted and a cache hit 
5 occurred on channel 3, circuit 640 tests the bit R0 

(step 770) for index II. If RO - 0, control passes to 
step 754, and the operation proceeds as described 
above. If RO = 1, R2 is tested (step 774). If R2 = 0, 
set 2 is the replacement set (step 780) . Tag Tl is 
10 written to set 2, and step 734 is performed in 

parallel. If R2 - 1. set 3 is the replacement set 
(step 784) . Tag Tl is written to set 3, and step 744 
is performed in parallel. 

If V3 is asserted, and either VI is deasserted or 
II and 13 do not coincide, the operation of circuit 640 
for channel 3 is similar to that illustrated in Fig. 7. 
However, if cache misses occur on both channels, then 
step 754 is not performed for index 13. Instead, R2 is 
tested at step 774. If R2 = 0, steps 780 and 734 are 
performed for index 13. If R2 = 1, steps 784 and 744 
are performed. Similarly to Fig. 7, step 754 is 
performed for index 13 if there is no cache miss on 
channel 1 and R0 = 0 for index 13. 

If both VI and V3 are asserted, the tag write 
operations for channels 1 and 3 are performed in 
parallel . The attributes in memory 6 50 are also 
updated in parallel. 

If both VI and V3 are asserted, the indices II and 
13 coincide, but the tags Tl and T3 are different, 
circuit 640 operates as follows. If cache hits occur 
on" both channels 1 and 3, circuit 640 generates new 
values for attributes R0 , Rl, R2 for index 11=13 in 
accordance with Table 1 below. The first column of 
Table 1 shows the sets in which the hits occur. Thus, 
in the first line, both hits are in set 0. The new 
attribute values are R0 = 1, Rl = 1. R2 = ->• The next 
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line indicates the new attributes when the cache hits 
are in sets 0 and 1, and so on. "*" means "don': 
care". The new atcributes are written to one of the 
write ports of memory 650. 

5 

Table 1 



Sets hit 


New attrs . 


R0 


Rl 




0 


1 


1 


- > 


0 , 1 


1 


* 


- > 


0 , 2 


★ 


1 


1 


0, 3 


★ 


1 


0 


1 


1 


0 


- > 


1, 2 


* 


0 


1 


1, 3 


★ 


0 


0 


2 


0 


- > 


1 


2, 3 


0 


- > 


* 


3 


0 


- > 


° 



20 Table 2 shows the operation of circuit 640 when 

the indices II and 13 coincide, a hit occurs on one of 
channels i and 3 and, simultaneously, a miss occurs on 
the other one of channels 1 and 3. The first column 
shows the set in which the hit occurred. The third 

25 column shows the replacement set for the channel on 

which a miss occurred. The next two columns show the 
new values for attributes Rl , R2 for the index II. R0 
is "don' t care" . 

The second column shows the attribute tested to 

30 determine the replacement set and also to determine the 
new attribute values. For example, if the hit occurred 
in set 0, R2 is tested. If R2 = 0, the replacement set 
is set 2, and the new attribute values are R0 = * 
("don't care"), Rl = 1 , R2 = 1. If R2 = 1, the 
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replacement set is 3, and the new attributes are 

PO = *, Rl = 1, R2 = 0. The new attributes are written 

to one of the ports o£ memory 650. 



Table 2 



0 



Set 
hit 


Old 
attr . 


Rep . 
set 




Rl 


R2 


0 


R2 = 0 


2 


1 


1 


R2 = 1 


3 


1 


0 


1 


R2 = 0 


«n 


0 


1 


R2 = 1 


3 


0 


0 




Rl = 0 


0 


1 


1 


Rl = 1 


1 


c 


1 


3 


Rl = o 


0 


1 


0 


Rl = 1 


1 


0 


c 



IS 



Table 3 illustrates the operation when cache 
misses occur on both ports, II = 13, and Tl is not 
equal to T3 . The replacement sets and the new 
attribute values depend on the values of attributes Rl, 
P2 listed in the first two columns of Table 3. The 
third column shows the replacement sets. The first 
reolacement set is for channel 1. This set is 
determined by attribute Rl . The second replacement 
set, for channel 3, is determined by attribute R2 . The 
new attributes Rl, R2 are shown in the last two 
columns. R0 is "don't care". The new attributes are 
written to one of the write ports of memory 650. 
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Table 3 



Old attrs 


Rep . 
sets 


Mew attrs 


Rl 


R2 


Rl 


R2 


0 


0 


0, 2 


1 


1 


0 


1 


0, 3 


1 


0 


1 


0 


1, 2 


0 


1 


n 


1 


1, 3 


0 


0 



10 Figs. 8 and 9 show other details of cache 110 of 

Fig. 5. Cache 110 is a write- through 32 Kbyte cache 
with 128 entries. Each data block 220 (Fig. 2) is 64 
bytes wide. Each data port Dl, D3, CD1, CD3 and MD0 
through MD3 (Fig. 5) is 64 bits wide. The word size is 

15 3 2 bins. The cache access time is one clock cycle. 

Each tag 210 (Fig. 3) includes: 1) bits [47:13] 
of the virtual address, and 2) context bits [11:0] . 
Index 310 includes bits [12:6] of the virtual address. 
Block offset 320 includes bits [5:0] of the virtual 

20 address. Bits [5:3] define the double word being 

accessed. Bits [2:0] define a byte in the double word. 

Fig. 9 illustrates data memories 910.0 through 
910.3 that hold data blocks 220. Each memory 910. i 
holds data for the respective set 120. i. 

25 Each memory 910. i is divided into four sections as 

shown by vertical lines in Fig. 9. The four sections 
correspond to four respective channels MD0-MD3. Each 
section has a separate write port. Four sections can 
be written from four respective channels MD0-MD3 in 

30 parallel. 

Each section holds two double words of each block 
220 in the respective set. For each block 220, its 
eight double words 0 through 7 are arranged as shown 
for memory 910.0. More particularly, double words 0 

35 and 4 are in section 0, double words 1 and 5 are in 

section 1, double words 2 and 6 are in section 2, and 
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double words 3 and 7 are in section 3. The section is 
identified by bits [4:3] of the virtual address. 

The 64 -bit data inputs of the write ports of 
sections 0 of all memories 910. i are connected to the 
5 output of register 920.0. Similarly, the data inputs 
of the write ports of all sections 1 are connected to 
the output of register 920.1. The data inputs of the 
write ports of ail sections 2 are connected to the 
output of register 920.2. The data inputs of the write 
10 ports of all sections 3 are connected to the output of 
register 920.3. Each register 920. i is 64 bits wide. 
The input of each register 920. i is connected to the 
output of respeccive multiplexer -930 . i . Each 
multiplexer 930. i has three data inputs connected 
15 respectively to: 1) port Dl of IEU 530, 2) port D3 of 
IEU 530, and 3) port MDi of external memory 550 
(Fig. 5) . 

Multiplexers 930. i are controlled by data cache 
control unit 940 (Fig. 8). Unit 940 includes circuits 
20 640, 630.1. 630.3, 632.1, 632.3, 634.1, 634.3 (Fig. 6!. 
Four different sections 0, 1, 2, 3 can be written 
simultaneously from registers 920. i. The four sections 
can be in the same memory 910.1 or in different 
memories. When a memory 910 is accessed, index 310 and 
25 block offset 320 are supplied to' the memory's address 
input. Unit 940 provides a separate write strobe for 
each section. One, two, three or four sections can be 
written at a time. 

Loading data from external memory 550 to memories 
30 910 is called a reload operation. Data are reloaded 
not necessarily in the order in which the data words 
apoear in memory 550. In particular, if a reload was 
caused by a load operation, then the data requested by 
the load are reloaded first. If the requested data are 
35 not at the beginning of block 220, the data at the 
beginning of block 220 can be loaded later. 
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For each se: 120.1, cache 110 includes also the 
following memories. These memories are shown in Fig. 8 
for set 120.0 only: 

1) VJTAC includes a tag validity bit for each 

5 tag in the respective set 120. i. The V_TAG memory has 
two read ports and two write ports. One read port and 
one write port are provided for each of channels 1 and 
3 . 

2) V_DATA has 8 bits [0:7] for each data block 
10 220 in the respective set. Each of the 3 bits 

indicates whether a respective double word in the data 
block is valid. V_DATA has three read ports and three 
write ports. One read port and one write port are 
provided for each of channels 1 and 3. In addition, a 

15 read port is provided for a reload operation to check 
if data has been already updated by a store issued 
after the reload request. If data has been updated 
before the cache is reloaded, the reload of the 
respective double word is aborted. Also, a write port 

20 is provided to set V_DATA bits in a reload operation. 

3) WJDATA ("wait data") has a bit for each data 
block in the set to indicate if the entire data block 
220 has been written in a reload operation. The W DATA 
memory has two read ports and six write ports. One 

25 read port and one write port are provided for each of 
channels 1 and 3. In addition, four write ports are 
provided for the four channels MD0 through MD3 in order 
to reset the W_DATA attributes at the end of a reload 
operation since in a reload the last double word of the 

30 block may come from any memory channel. 

The outputs of memories V_DATA and W_DATA are 
connected to unit 940. 

The channel -1 output of memory V_TAG of each set 
120. i is connected to respective comparator 620.1.1. 

3 5 The channel -3 output of V_TAG is connected to 

respective comparator 620. i. 3. If a V_TAG output shows 
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an invalid tag, the output of the respective comparator 
indicates that the comparator inputs do not match. 

Fia. 8 shows registers 950.1, 950.3 omitted for 
simplicity from Fig. 6. In Fig. 3, multiplexer 670.1 
5 is a combination of multiplexers 6701.1, 670T.1 of 
Fig. 63. Multiplexer 670.3 is a combination of 
multiplexers 6701.3, 670T.3 of Fig. 6B . The outputs of 
multiplexers 670.1, 670.3 are connected to respective 
registers 950.1, 950.3. The output of register 950.1 

10 is connected to memories 610.0, 610.1. The output of 
register 950.3 is connected to memories 610.2, 610.3. 

All registers 950. i, 920. j (Fig. 9) are clocked by 
the same clock. 

Each memory 910.1 has two read ports for 

15 respective channels 1 and 3. Both read ports can be 

read simultaneously. The outputs of the channel -1 read 
ports of memories 910. i are connected to the respective 
four data inputs of multiplexer 960.1. The channel -3 
outputs are connected to respective data inputs of 

20 multiplexer 960.3. The select inputs of multiplexers 

960.1, 960.3 are connected to respective outputs Si, S3 
of comparators 620. i.j (Fig. 63) . The output of 
multiplexer 960.1 is connected to input CD1 of IEU 530. 
The output of multiplexer 960.3 is connected to input 

25 CD3 of IEU 530. The data on channels 1 and 3 can be 
provided by memories 910 simultaneously. 

When cache 110 needs to issue a request to access 
external memory 550 (to perform a memory store or a 
reload), unit 940 asserts signals on output RQi (Fig. 

30 5) for channel 1 or output RQ3 for channel 3. If cache 
misses occurred on channels 1 and 3 simultaneously, the 
recuests to access memory 550 are issued on outputs 
RQI, RQ3 (i.e., cn channels 1 and 3) simultaneously if 
they relate to different data blocks. If both cache 

35 misses are in the same data block, one request for a 

data block is issued to memory 550 on one of channels 1 
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and 3, using the respective one of outputs RQi, RQ3 . 
In response, memory 550 returns the double word. in 
which one of the cache misses occurred. This double 
word is loaded into cache 110 and register file RF. 
The other 7 double words are returned at the same time 
or later. In parallel with the data block request on 
one of channels 1 and 3, the other one of channels 1 
and 3 is used to request the double word in which the 
other cache miss occurred. The double word for the 
other cache miss is loaded into the register file R? 
(Fig. 10) in IEU 530. The parallel requests on 
channels 1 and 3 facilitate making the cache non- 
blocking and serve to increase the processor 
performance in non-blocking cache embodiments. In non- 
15 blocking embodiments, a cache miss on channel 1 or 3 

does not prevent a concurrent cache access on the other 
one of channels 1 and 3; also, if a cache miss occurs 
on channel 1 or 3, succeeding accesses to the cache on 
the same channel are not blocked; these accesses can 
20 proceed while data are reloaded in response to the 
cache miss. 

Unit 94 0 also receives a memory response for 
channels MD0-MD3. The memory response includes the 
index and the set number for the cache 110. The index 

25 and the set number are sent to memory 550 with a memory 
request. The index and the set number are returned by 
memory 550 v/ith the data. 

If a cache reload is caused by a load operation, 
the corresponding tag valid bit V_TAG and wait data bit 

30 W_DATA are set to 1, and the data valid bits V_DATA 
[0:7] are set to 0 for the corresponding data block. 
External interface 540 sends to memory 550 a request 
for 3 words, a DCACHE data field flag (this flag means 
a request for a block of 8 words for cache 110) , the 

35 respective index II or 13, and the replacement set 

number (0, 1, 2, or 3). As data come from memory 550. 
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che corresponding V_DATA bits are sec to 1. The data 
can be read from cache 110 as soon as they are wricten 
from memory 550, before the entire block is written. 
When the whole block is written, the corresponding 
W DATA bit is set to 0. If a load operation gets a 
cache hit but the corresponding V_DATA bit is 0, a 
request for one double word goes to memory 550. 

In a memory score operation, a byte, a half word, 
a word or a double word is written to memory 550 and, 
in case of a cache hit, to cache 110. In a double word 
store, the double word and the tag are also written to 
cache 110 in case of a cache miss. The corresponding 
bits VJTAG, W_DATA and VJDATA are set co 1. The 
remaining seven V_DATA bics are set co 0. A request 
for seven words is issued to memory 550. 

If store operations are performed simultaneously 
on channels 1 and 3 , and they hit the same section or 
they hit sections having the same section number, then 
the cache data corresponding to one of the two store 
operations is invalidated. Invalidations are performed 
by resetting the corresponding bits in the V_DATA 
memory . 

A data block can be replaced only if its W_DATA is 
0. The replacement block is selected from the blocks 
having W_DATA =0. If such a block is no: found, the 
data are not cached. 

Processor 520 includes a memory management unit 
(MMU) which includes a 4-port data translate look-aside 
buffer (DTLB) to speed up virtual - co-physical address 
cranslation. TLBs are known in the art. See, for 
example, B. Catanzaro, "Multiprocessor System 
Architectures " (Sun Microsystems, Inc. 1994) hereby 
incorporated herein by reference, a: page 96. Unit 940 
receives MMU signals for channels 1 and 3. In. 
addition, unit 940 receives the following signals for 
channels 1 and 3 : 
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1) TLB Jnit indicating whether DTLB was hit 
during the channel access. 

2) CACHE ABLE indicates whether the channel data 
can be cached. 

3) GLOBAL - If this flag is sec, the context 
fields in tag memories 610 and in virtual addresses 
VA1, VA3 are ignored during the tag search. 

4) VECTOR indicates whether the channel data are 
vector or scalar. Cache 110 is used only for scalar 
data. 

If cache 110 is hit and the DTL3 is missed, the. 
cache location is invalidated. 

Two or more virtual addresses can be mapped to the 
same physical address. This is called aliasing. To 
maintain cache consistency, page table entries contain 
an alias attribute which shows if the virtual page has 
an alias. DTLB entries have an alias mark showing if 
the corresponding pages have an alias. If virtual 
pages are aliases of one another, their data are cached 
in che same set. Of note, index 310 (Fig. 3) is a 
subset of a page offset. Therefore, data from a given 
physical location in a page that has aliases is always 
cached in the same location in cache 110. 

When an alias is created and an alias attribute is 
turned on in a page table, software is responsible for 
flushing cache 110. 

While the invention was illustrated with respect 
to the embodiments described above, che invention is 
not limited by these embodiments. In part icular, the 
invention is not limited by the type of information 
cached in the cache. Some cache embodiments store both 
instructions and data, or only instructions. Vector 
data are cached in some cache embodiments. In some 
embodiments, the cache is accessed using physical- 
rather than virtual addresses. In some embodiments, 
the cache is fully associative- -data can be cached in 
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any cache entry. The " invention is not limited to 
write-through caches or to LRU type replacement 
policies. Other embodiments and variations are within 
the scope of the .invention, as defined by the appended 
claims . 
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APPENDIX 

VLIW CPU 520 of Fig. 10 uses Instruction Level 
Parallelism (ILP) to ensure high performance. The 
compiler can plan CPU work in each cycle. CPU 520 can 
execute concurrently a few simple independent 
instructions (operations) that constitute a wide 
instruction (load, store, add, multiply, divide, shift, 
logical, branch, etc.). Wide instructions are stored 
in memory and in an instruction cache (I CACHE) in 
packed form as sets of 16 and 32 bit syllables. An 
operation can occupy a part of syllable, a whole 
syllable, or several syllables. 

CPU 520 contains an Instruction Buffer (IB), a 
Control Unit (CU) , a multiport Predicate File (PF) , a 
mulziport Register File (RF) , a Calculate Condition 
Unit (CCU) , a Data Cache 110 (DCACHE) , four Arithmetic 
Logic Units (ALU0 - ALU3 ) , an Array Prefetch Buffer 
(AP3) , four Array Access Channels (AAC0 - AAC3) , a 
Memory Management Unit (MMU) and a Memory Access Unit 
( MAU ) . 

The Instruction Buffer (IB) contains 2048 64-bit 
double words and is divided into 16 sectors. Program 
code and data are accessed using virtual memory. IB' 
has a separate Instruction " Translate Lookaside Buffer 
(1713) with 32 entries. IB filling is initiated by 
hardware for sequential instruction flow when 
sequential instructions are exhausted in IB arid by a 
program when a prepare control transfer operation is 
executed. IB performs program code filling for three 
branches. In the case of an IB miss the program code 
is loaded from memory by 4 memory access channels in 
parallel (4 64 -bit double words simultaneously) . 
Control Unit (CU) reads from 13 and dispatches one 
maximum size wide instruction (9 64 -bit double words) 
every cycle. 
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The Control Unit generates an unpacked form of a 
wide instruction, converts indirect based operand 
addresses for a wide instruction to absolute register 
file addresses, and checks the following conditions for 
5 wide instruction: 

no exceptions, 

no interlock conditions from other units of 
CPU, 

operands availability in RF . 
10 CU issues wide instruction's operations for 

execution and performs the following: 

reads up to 10 operands from RF to ALUO - 
ALU3 , 

reads up to 3 predicate values from P? to CU 
15 as condition code for control transfer 

operations, 

reads up to 8 predicate values from PF to CCU 
for new predicate values calculation and 
generation of a mask of conditional execution 
20 of operations in ALUO - ALU3 and AACO - AAC3 , 

issues literal values to ALUO - ALU3 and AACO 
- AAC3, 

issues up to 4 operations to ALUO - ALU3 , 
issues up to 4 operations to AACO - AAC3 , 
25 - issues up to 11 operations to CCU, 

issues a prepare control transfer operation 
to CU, 

checks the possibility of the execution of 
three control transfer operations in CU. 
30 The Predicate File (PF) is a storage of predicate 

values generated by integer and floating point compare 
operations. Predicate values are used to control the 
conditional execution of operations. The Predicate 
File contains 32 two-bit registers. 
35 The Calculace Condition Unit (CCU) generates a 

mask for the conditional execution of ALUi and AACi 
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operations and calculates values cf the secondary 
predicate as a function of the primary eradicates. 

The Register File (RF) contains 25b 66-bit 
registers and has 10 read ports and 8 write ports. All 
5 10 read ports are used to read ALU operands and 2 read 
ports are used to read values to DCACHE 110 and MtMU 
when these values are being written to memory. 4 write 
ports are used to write ALU results and other 4 write 
pores are used to write values loaded from memory. 

10 ALUO - ALU3 are 4 parallel execution channels and 

have almost the same sets of arithmetic and logic 
operations. In addition, ALU1 and ALU3 are used to 
calculate addresses for scalar memory accesses. All 
ALUs get their, operands from RF and via a bypass. The 

15 bypass reduces the time of delivery of ALU operation 

results to subsequent operations. ALUO and ALU2 get 2 
operands and ALU1 and ALU3 get 3 operands because they 
can execute combined 3 -argument operations. ALU 
operation results are written to RF through 4 RF write 

2 0 channels. 

The Array Access Channels AA CO - AAC3 are 4 
parallel channels for generation of array element 
address for loops. Each AACi contains 8 pairs of 
address registers. Each pair includes a current 

25 address register and an increment register. All AACi 

have the same operation set: the current array element 
address generation (with or without the next element 
address calculation) . For memory accesses, one pair of 
address registers in each channel is used in every 

30 cycle. AAC0 and AAC2 are used only for load memory 
accesses, AACI and AAC3 are used for load and store 
memory accesses. 

The Memory Management Unit contains 4 -port Data 
Translate Lookaside Buffer (DTLB) with 64 entries and 

3 5 performs hardware searches in a Page Table in DTLB miss 
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cases- In addition, MMU contains Disambiguation Memory 
for checking latencies of load and store operations. 

The Memory Access Unit contains an entry buffer 
for memory requests and a cross bar of 4 data and 1 
5 group IB memory access channels to 4 physical memory 
channels. 2 least significant bits of physical 
addresses are the physical memory channel number. 

The DCACHE 110 output is combined with the ALU 
output. This permits to use bypass to reduce data 
10 transfer to ALUs . 

The Array Prefetch Buffer is used to prefetch 
array elements for loops from memory. APB is a 4- 
channel FIFO buffer. APB contains 4x43 66-bit 
reaisters. Data are transferred from APB to RF when 
15 ready. 

CPU 520 has 4 memory access channels. Each 
channel has a 64 bit data path. 
"MX" means a multiplexer. 
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CLAIMS 

1. A multi-port cache syscem comprising: 

a plurality of sets, each set comprising a memory 
for caching one or more units of information; 
5 a memory for storing one or more data trees for 

selecting, from the sets, replacement sets in which 
units of information are to be cached, wherein each 
leaf node in each tree corresponds to a group of one or 
• more of the sets, and each leaf node is for selecting a 

10 - replacement set in the corresponding group of the sets, 
wherein each tree is suitable for being searched from 
any node to a leaf node to select a replacement set, 
each non-leaf node to specify its child node to which 
the search is to proceed; 

15 a plurality of ports for accessing the cache; and 

a -circuit for determining a number Ul of new unics 
of information that are to be cached in response to 
cache misses occurring simultaneously on one or more of 
the ports, and for searching one or more of the trees 

20 for at least Nl replacement sets to cache the Ul units 
of information, wherein Nl > 0 , and wherein if Ul > 1 
then Nl > 1 and the circuit starts a search for each of 
Nl replacement sets from a separate one of the tree 
nodes. 

25 

2 . The cache system of Claim 1 wherein each 
group of sets comprises at least one write port to 
write to one or more sets of the group, wherein writing 
to different write ports can proceed simultaneously. 

30 

3. The cache system of Claim 1 wherein each set 
comprises a write port, and writing to different sets 
through their respective write ports can proceed 
simultaneously. 
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: . . 4. The cache system of Claim 1 wherein the 
groups corresponding to . different: leaf nodes of any one 
of ■ the trees do not intersect . 

5. The cache system of Claim 1 wherein Nl = Ul 
and the number of ports does not exceed the number of 
leaf nodes in any one of the trees. 

6. The cache system of Claim 1 wherein each set 
comprises a plurality of slots, each slot for storing a 
block of information, wherein all the slots having the 
same address in all the sets form an entry, and 

the one or more trees comprise a separate 
data tree for each entry. 



7. The cache system of Claim 6 wherein: 

in each data tree, each leaf node is to select the 

least recently used slot in the corresponding entry; 

and 

20 each non-leaf node corresponds to a group of sees 

which are all the sets in all the groups corresponding 
to all leaf children of the non-leaf node, and the non- 
leaf node defines a group of slots which are all the 
slots in the corresponding group of sets in the 

25 corresponding entry, and each non-leaf node is to 

specify its immediate child node defining the lease 
recently used group of slots among ail the groups 
defined by the immediate children of the non-leaf node. 

30 3. A computer system comprising the cache of 

Claim 1 and one or more instruction execution channels, 
wherein each execution channel is connected to a 
separate one of the ports for accessing the cache. 

35 9 , a method for providing a mulci-porc cache 

svscem, the method comprising: 
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providing a plurality of sees, each sec comprising 
a memory for caching one or more units of information; 

providing a memory for storing one or more data 
trees for selecting, from the sets, replacement sets in 
5 which units of information are to be cached, wherein 
each leaf node in each tree corresponds to a group of 
one or more of the sets, and each leaf node is for 
selecting a replacement set in the corresponding group 
of the sets, wherein each tree is suitable for being 
10 searched from any node to a leaf node to select a 
replacement set, each non-leaf node to specify its 
child node to which the search is to proceed; 

providing a plurality of ports for accessing the 
cache; and 

15 providing a circuit for determining a number Ul of 

new units of information that are to be cached in 
response to cache misses occurring simultaneously on 
one or more of the ports, and for searching one or more 
of the trees for at least Ml replacement sets to cache 

20 the Ul units of information, wherein Nl > 0, and 

wherein if Ul > 1 then Nl > 1 and the circuit starts a 
search for each of Ml replacement sets from a separate 
one of the tree nodes. 

25 10 . A method for. caching information in a multi- 

port cache comprising a plurality of sets stored in a 
memory, the method comprising: 

selecting M nodes in one or more tree data 
structures stored in a memory, where M is a number of 
30 cache misses that occurred simultaneously; 

for each selected node, searching a tree of 
children of the selected node to determine a leaf node; 

for each leaf node determined as a result of a 
search, using a sez selected by the leaf node as a 
35 replacement set for a respective cache miss. 
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11. The method of Claim 10 wherein M > 1 and the 
me c hod further comprises simultaneous writing to the 
replacement sets to update the cache. 

12. The method of Claim 11 wherein each set 
comprises a write port, and simultaneous writing to the 
replacement sets proceeds through a plurality of the 
write ports of the replacement sets. 

13. The method of Claim 10 wherein each set 
comprises a tag memory comprising a single write port, 
and simultaneous writing to the replacement sets 
comprises simultaneous writing of tags through a 
plurality of the write ports of the tag memories. 



15 
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